feat(kiloclaw): proactively refresh API keys approaching expiry#1049
feat(kiloclaw): proactively refresh API keys approaching expiry#1049pandemicsyn merged 9 commits intomainfrom
Conversation
The reconciliation alarm now checks if the instance's API key expires within 7 days and, if the controller supports it, mints a fresh key, pushes it via the new /_kilo/env/patch endpoint, updates the Fly machine config (without restart via skip_launch), and persists the new expiry. Key changes: - Controller: POST /_kilo/env/patch with KILOCODE_API_KEY allowlist - Fly client: skip_launch option on updateMachine - Reconcile: reconcileApiKeyExpiry with version gating, mint, push, persist - Config: isCalverAtLeast helper and PROACTIVE_REFRESH_THRESHOLD_MS constant
…controllers Restructure reconcileApiKeyExpiry so the version check only gates the push-to-controller step, not the entire flow. When the controller is too old for /_kilo/env/patch, we still mint a fresh key, update the Fly machine config (triggering a restart), and persist to DO state. Also: - Reduce PROACTIVE_REFRESH_THRESHOLD_MS from 7 days to 3 days - Guard against starting stopped machines (check Fly machine.state before deciding skipLaunch) - Add test for stopped-machine safety guard
Code Review SummaryStatus: 1 Issues Found | Recommendation: Address before merge Fix these issues in Kilo Cloud Overview
Issue Details (click to expand)WARNING
Other Observations (not in diff)Issues found in unchanged code that cannot receive inline comments: N/A Files Reviewed (15 files)
Reviewed by gpt-5.4-20260305 · 669,096 tokens |
- Reorder: update Fly config (skipLaunch) before hot patch attempt so the key is durably persisted before we try the live push - Forward minSecretsVersion from ensureEnvKey() to updateMachine to prevent secret propagation races on restart - Use updateMachine without skipLaunch for restart instead of stop+start to avoid leaving the machine stopped on partial failure - Only persist new key/expiry to DO state when at least one delivery path succeeded (push or Fly config update) - Make refresh threshold configurable via PROACTIVE_REFRESH_THRESHOLD_HOURS wrangler var (default 72h) for testing - Reduce default threshold from 7 days to 3 days
Remove MIN_ENV_PATCH_CONTROLLER_VERSION, isCalverAtLeast(), and the getControllerVersion() pre-flight check. Instead, always try the push to /_kilo/env/patch — if the controller returns 404 (old image), the catch block handles it and falls through to the restart path. This eliminates a manually maintained calver constant that had to match the controller release date. The cost is one extra HTTP call per refresh event on old controllers (the 404), which is negligible since refresh only triggers once per key expiry cycle.
Never force-restart the machine during key refresh. The Fly config is updated with skipLaunch (durable persist), the push is attempted for live delivery, and if the push fails the machine picks up the new key on its next natural restart (user-initiated, crash, deploy). This avoids any risk of downtime caused by the refresh process itself.
…t env patch - Fetch controller version (best-effort) during API key refresh and include it in api_key_expiry_approaching, api_key_push_error, and api_key_refreshed log events for observability - Add env patch endpoint checks to controller-smoke-test.sh
…vely-refresh-token # Conflicts: # kiloclaw/src/durable-objects/kiloclaw-instance/gateway.ts
| } | ||
|
|
||
| for (const [key, value] of Object.entries(validated)) { | ||
| process.env[key] = value; |
There was a problem hiding this comment.
WARNING: The hot-patched key leaves KILO_API_KEY stale when the Kilo CLI feature is enabled
start-openclaw.sh aliases KILOCODE_API_KEY into KILO_API_KEY before launching the controller, but this route only updates process.env.KILOCODE_API_KEY. After SIGUSR1, the supervised gateway child respawns from the controller's current environment, so it still inherits the old KILO_API_KEY and the Kilo CLI auth plugin keeps using the expired token.
There was a problem hiding this comment.
ill fix up the cli in a follow up.
…vely-refresh-token # Conflicts: # kiloclaw/controller/src/index.ts
| if (envOverrideHours) { | ||
| const hours = Number(envOverrideHours); | ||
| if (!Number.isNaN(hours) && hours > 0) { | ||
| return hours * 60 * 60 * 1000; |
There was a problem hiding this comment.
WARNING: Thresholds at or above the token lifetime trigger refresh on every alarm
mintFreshApiKey() always issues 30-day tokens (KILOCODE_API_KEY_EXPIRY_SECONDS). Because this helper accepts any positive hour value, setting PROACTIVE_REFRESH_THRESHOLD_HOURS to 720 or larger means a newly minted key is still inside the threshold on the next 5-minute reconcile, so the worker will mint and push a fresh token forever. Clamp the override below the token lifetime or fall back to the default before returning it.
Summary
Instances' API keys (JWTs) have a fixed expiry. Today, if a key expires while a sandbox is running, the gateway loses API access until the next full restart re-mints the key. This PR adds proactive refresh: the reconciliation alarm checks if the key expires within 3 days (configurable via
PROACTIVE_REFRESH_THRESHOLD_HOURSwrangler var) and mints a fresh one.How the fresh key is delivered:
skipLaunch(durable persist). This ensures the key survives cold starts regardless of whether the live push succeeds.process.envviaPOST /_kilo/env/patchis attempted. If the controller supports it,SIGUSR1triggers a graceful in-process restart in OpenClaw — it drains active tasks (up to 90s), closes the server, and restarts the server loop, which re-readsprocess.envand picks up the new key.No version gating — capability detection is used. The push is always attempted; a 404 from old controllers is caught and handled gracefully.
Failure handling:
What changed:
POST /_kilo/env/patch): accepts an allowlisted set of env vars (KILOCODE_API_KEY), writes them toprocess.env, and sendsSIGUSR1to the gateway. Bearer-auth gated same as existing/_kilo/config/*routes.updateMachine): addedskipLaunchoption — updates machine config without restarting.reconcileApiKeyExpiry): new step wired afterreconcileVolume. Flow: mint → update Fly config (skipLaunch) → try push → persist only if at least one path succeeded.minSecretsVersionforwarded fromensureEnvKey()to prevent secret propagation races.getProactiveRefreshThresholdMs): readsPROACTIVE_REFRESH_THRESHOLD_HOURSwrangler var with fallback to 72h default. Set to a large value (e.g.8760= 1 year) to trigger refresh on all running instances for testing.Verification
pnpm typecheck— passespnpm test— 566/566 tests pass (30 test files), including 26 new tests across 5 new/modified test filespnpm lint— passesVisual Changes
Old controller:
New controller:
Reviewer Notes
/_kilo/env/patchis always attempted. Old controllers return 404, which is caught gracefully. No manual version constant to maintain.skipLaunch: truewithminSecretsVersionfromensureEnvKey().pushed || flyConfigUpdated). If both fail, the next alarm retries.Promise.racefor the mint timeout clears the timer on success. Chosen overAbortSignal.timeoutbecause Hyperdrive doesn't propagate abort signals.PROACTIVE_REFRESH_THRESHOLD_HOURSto8760(1 year) in wrangler vars — every running instance with a known expiry will refresh on its next alarm cycle (within 5 min).tag:"reconcile" AND action:"api_key_*". Key events:api_key_refreshed(withpushedandflyConfigUpdatedfields),api_key_push_error,api_key_refresh_failed_all_paths.